Fast Phonetic Similarity Search over Large Repositories

نویسندگان

  • Hegler Tissot
  • Gabriel Peschl
  • Marcos Didonet Del Fabro
چکیده

Today there is a large amount of unstructured data produced by information systems from different domains. These sources may be analyzed for different purposes. Existing approaches use string similarity methods to search for valid words within a text, with a supporting dictionary. However, they have two main drawbacks. First, they are not rich enough to encode phonetic information to assist the search. Second, the solutions may be inefficient in the presence of spelling errors. In this paper, we present a novel approach for efficiently perform phonetic similarity search over large data sources. We present a data structure called PhoneticMap, which encodes language-specific phonetic information. The phonetic maps are used by a novel fast similarity search algorithm to find words with spelling errors. We validate our approach through an experiment over a data set using a Portuguese variant of a well-known repository, to automatically correct words with spelling errors.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Effective and efficient similarity search in scientific workflow repositories

Scientific workflows have become a valuable tool for large-scale data processing and analysis. This has led to the creation of specialized online repositories to facilitate workflow sharing and reuse. Over time, these repositories have grown to sizes that call for advanced methods to support workflow discovery, in particular for similarity search. Effective similarity search requires both high ...

متن کامل

Large Scale Machine Learning Jan 18 , 2016 Lecture 5 : Large - Scale Search : Locality Sensitive Hashing ( LSH )

Nowadays, there exist hundreds of millions of images online. These images are either stored in web pages, or databases of companies, such as Facebook, Flickr, etc. It is challenging to quickly find similar images from these huge repositories. This is because: • The repositories are huge. Facebook has around 10 billion images [2]. These images have different resolution, dimension. • Images are v...

متن کامل

m3 - A Behavioral Similarity Metric for Business Processes

With the increasing uptake of business process management, companies maintain large scale process repositories consisting of hundreds or thousands of process models. So far, discovery within these repositories is limited to free text search or folder navigation. In a separate stream of research, similarity measures were introduced to get a better understanding of the relationships between proce...

متن کامل

Metric Trees for Efficient Similarity Search in Large Process Model Repositories

Due to the increasing adoption of business process management and the key role of process models, companies are setting up and maintaining large process model repositories. Repositories containing hundreds or thousands of process models are not uncommon, whereas only simplistic search functionality, such as text based search or folder navigation, is provided, today. On the other hand, advanced ...

متن کامل

Optimal Distance Bounds on Time-Series Data

Most data mining operations include an integral search component at their core. For example, the performance of similarity search or classification based on Nearest Neighbors is largely dependent on the underlying compression and distance estimation techniques. As data repositories grow larger, there is an explicit need not only for storing the data in a compressed form, but also for facilitati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014